Efficient Algorithms for Locating the Length-Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis
نویسندگان
چکیده
We study two fundamental problems concerning the search for interesting regions in sequences: (i) given a sequence of real numbers of length n and an upper bound U , find a consecutive subsequence of length at most U with the maximum sum and (ii) given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. We present an O(n)time algorithm for the first problem and an O(n log L)-time algorithm for the second. The algorithms have potential applications in several areas of biomolecular sequence analysis including locating GC-rich regions in a genomic DNA sequence, post-processing sequence alignments, annotating multiple sequence alignments, and computing length-constrained ungapped local alignment. Our preliminary tests on both simulated and real data demonstrate that the algorithms are very efficient and able to locate useful (such as GC-rich) regions.
منابع مشابه
Definitions and Algorithms in SEGID
Given a (multiple) sequence alignment, SEGID first converts it into a sequence of numbers, where each number is the alignment score of a column. (SEGID also directly accepts a sequence of numbers as input.) Then it provides three algorithms to identify conserved segments (high score substrings): 1. Longest segment (with average value lower bound): given a string of numbers and a number A, find ...
متن کاملConstrained Heaviest Segments in a Number Sequence and Their Applications in Biomolecular Sequence Analysis (Working Draft)
متن کامل
Algorithms for the Problems of Length-Constrained Heaviest Segments
We present algorithms for length-constrained maximum sum segment and maximum density segment problems, in particular, and the problem of finding length-constrained heaviest segments, in general, for a sequence of real numbers. Given a sequence of n real numbers and two real parameters L and U (L 6 U), the maximum sum segment problem is to find a consecutive subsequence, called a segment, of len...
متن کاملGenomic Sequence Analysis: A Case Study in Constrained Heaviest Segments (Working draft)
Methods for genomic sequence analysis have been studied for more than a decade. One line of investigation is to locate the biologically meaningful segments, like conserved regions or GC-rich regions in DNA sequences. A common approach is to assign a real number (also called scores) to each residue, and then look for the maximum-sum or maximum-average segment. In this chapter, we address a few i...
متن کاملMAVG: locating non-overlapping maximum average segments in a given sequence
SUMMARY MAVG is a software tool for finding k non-overlapping maximum-average segments that are sufficiently long in a given sequence of real numbers, for any k > 0. It has applications in several areas of biomolecular sequence analysis including locating GC-rich regions and CpG islands in a genomic sequence, and annotating multiple sequence alignments. AVAILABILITY http://iubio.bio.indiana.e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002